FYI

  • janitor::chisq.test() and janitor::fisher.test() mask the versions in {stats}
  • After loading {janitor} if you call chisq.test() without specifying which package, it will default to the {janitor} package (loaded in last)

Data

  • Using three data sets which we have “messed up” in various ways (all from Kaggle)
    • Species
    • Characters
    • Films

Species

id name classification designation µ_height skin_colors hAiR_cOlOrS eye_colors µ_lifespan language hOmEwOrLd
1 Human Mammal Sentient 1.80 Light, Dark Various Various 79 Galactic Basic Various
2 Yoda's species Unknown Sentient 0.66 Green White Brown 900 Galactic Basic Unknown
3 Wookiee Mammal Sentient 2.28 Brown Brown Blue 400 Shyriiwook Kashyyyk
4 Gungan Amphibian Sentient 1.96 Orange None Orange 70 Gungan Naboo
5 Twi'lek Mammal Sentient 1.80 Blue, Green, Red, Yellow None Various 80 Twi'leki Ryloth
6 Mon Calamari Amphibian Sentient 1.70 Orange None Red 70 Mon Calamarian Mon Cala
7 Rodian Reptilian Sentient 1.70 Green None Black 78 Rodese Rodia
8 Droid Artificial Non-sentient
Various None Various
Binary None
9 Trandoshan Reptilian Sentient 2.00 Green None Red 70 Dosh Trandosha
10 Kel Dor Mammal Sentient 1.88 Orange None Black 70 Kel Dor Dorin

Characters

id name species gender height weight hair_color eye_color skin_color year_born homeworld year_died description
1 Luke Skywalker Human Male 1.72 77 Blond Blue Light 19 Tatooine 34 The main protagonist of the original trilogy.
2 Leia Organa Human Female 1.50 49 Brown Brown Light 19 Alderaan 35 A leader in the Rebel Alliance and twin sister of Luke Skywalker.
3 Darth Vader Human Male 2.02 136 None Yellow Pale 41 Tatooine 4 The Sith Lord formerly known as Anakin Skywalker.
4 Yoda Yoda's species Male 0.66 17 White Brown Green 896 Unknown 4 A wise and powerful Jedi Master.
5 Han Solo Human Male 1.80 80 Brown Hazel Light 29 Corellia 34 A smuggler turned hero in the Rebel Alliance.
6 Chewbacca Wookiee Male 2.28 112 Brown Blue Brown 200 Kashyyyk
A loyal Wookiee and co-pilot of the Millennium Falcon.
7 Obi-Wan Kenobi Human Male 1.82 81 White Blue-gray Light 57 Stewjon 0 A Jedi Master who mentored Anakin and Luke Skywalker.
8 Emperor Palpatine Human Male 1.73 75 None Yellow Pale 82 Naboo 35 The Sith Lord who became Emperor of the galaxy.
9 Anakin Skywalker Human Male 1.88 84 Blond Blue Light 41 Tatooine 4 A Jedi Knight who fell to the dark side and became Darth Vader.
10 Padmé Amidala Human Female 1.65 45 Brown Brown Light 46 Naboo 19 The queen-turned-senator of Naboo and wife of Anakin Skywalker.

Films

id title release_date director producer opening_crawl sl_opinion ck_opinion
1 Episode I: The Phantom Menace 36299 George Lucas Rick McCallum Turmoil has engulfed the Galactic Republic... good
2 Episode II: Attack of the Clones 37392 George Lucas Rick McCallum There is unrest in the Galactic Senate... good
3 Episode III: Revenge of the Sith 38491 George Lucas Rick McCallum War! The Republic is crumbling under attacks... good
4 Episode IV: A New Hope 28270 George Lucas Gary Kurtz, Rick McCallum It is a period of civil war... good
5 Episode V: The Empire Strikes Back 29362 Irvin Kershner Gary Kurtz, Rick McCallum It is a dark time for the Rebellion... good
6 Episode VI: Return of the Jedi 30461 Richard Marquand Howard G. Kazanjian, George Lucas, Rick McCallum Luke Skywalker has returned to his home planet... good
7 Episode VII: The Force Awakens 42356 J.J. Abrams Kathleen Kennedy, J.J. Abrams, Bryan Burk Luke Skywalker has vanished...

8 Episode VIII: The Last Jedi 43084 Rian Johnson Kathleen Kennedy, Ram Bergman The FIRST ORDER reigns... good
9 Episode IX: The Rise of Skywalker 43819 J.J. Abrams Kathleen Kennedy, J.J. Abrams, Michelle Rejwan The dead speak!...

10 Rogue One: A Star Wars Story 42720 Gareth Edwards Kathleen Kennedy, Allison Shearmur, Simon Emanuel The Rebellion makes a risky move... good
11 Solo: A Star Wars Story 43245 Ron Howard Kathleen Kennedy, Allison Shearmur, Simon Emanuel Board the Millennium Falcon and journey to a galaxy far, far away... good

clean_names()

Cleans names of an object (usually a data.frame)

names(species_messy)
 [1] "id"             "name"           "classification" "designation"   
 [5] "µ_height"       "skin_colors"    "hAiR_cOlOrS"    "eye_colors"    
 [9] "µ_lifespan"     "language"       "hOmEwOrLd"     


species_messy |> 
  clean_names(
    case = "snake", #default case
  ) |> 
  names()
Warning in warn_micro_mu(string = string, replace = replace): Watch out!  The mu or micro symbol is in the input string, and may have been converted to 'm' while 'u' may have been expected.  Consider adding the following to the `replace` argument:
The following characters are in the names to clean but are not replaced: \u00b5
 [1] "id"               "name"             "classification"   "designation"     
 [5] "m_height"         "skin_colors"      "h_ai_r_c_ol_or_s" "eye_colors"      
 [9] "m_lifespan"       "language"         "h_om_ew_or_ld"   

clean_names() continued

species_messy |> 
  clean_names(
    case = "snake", #default case
  ) |> 
  names()
 [1] "id"               "name"             "classification"   "designation"     
 [5] "m_height"         "skin_colors"      "h_ai_r_c_ol_or_s" "eye_colors"      
 [9] "m_lifespan"       "language"         "h_om_ew_or_ld"   


  • the replace argument: a named character vector where the name is replaced by the vector
species_messy |> 
  clean_names(
    case = "snake", # default case
    replace = c("hAiR" = "hair", "cOlOrS" = "colors", "hOmEwOrLd" = "homeworld", 
                janitor:::mu_to_u) 
  ) |> 
  names()
 [1] "id"             "name"           "classification" "designation"   
 [5] "u_height"       "skin_colors"    "hair_colors"    "eye_colors"    
 [9] "u_lifespan"     "language"       "homeworld"     

clean_names() continued (again)

species_messy |> 
  clean_names(
    case = "snake", #default case
  ) |> 
  names()
 [1] "id"               "name"             "classification"   "designation"     
 [5] "m_height"         "skin_colors"      "h_ai_r_c_ol_or_s" "eye_colors"      
 [9] "m_lifespan"       "language"         "h_om_ew_or_ld"   


BONUS To make the code more tidy, you can change the parsing option, but it will override the case= argument.

species_messy |> 
  clean_names(
    case = "snake",
    parsing_option = 0,
    replace = janitor:::mu_to_u
  ) |> 
  names()
 [1] "id"             "name"           "classification" "designation"   
 [5] "u.height"       "skin.colors"    "hair.colors"    "eye.colors"    
 [9] "u.lifespan"     "language"       "homeworld"     

make_clean_names()

Returns a clean character vector

species_messy |> 
  select(name) |> 
  mutate(
    NAME_CLEAN = make_clean_names(name)
  )
# A tibble: 40 × 2
   name           NAME_CLEAN   
   <chr>          <chr>        
 1 Human          human        
 2 Yoda's species yodas_species
 3 Wookiee        wookiee      
 4 Gungan         gungan       
 5 Twi'lek        twilek       
 6 Mon Calamari   mon_calamari 
 7 Rodian         rodian       
 8 Droid          droid        
 9 Trandoshan     trandoshan   
10 Kel Dor        kel_dor      
# ℹ 30 more rows

get_dupes()

Shows the duplicated rows according to a certain column (or multiple columns).

characters_messy |> 
  get_dupes(name) |> 
  select(name, dupe_count, species, height, weight, homeworld) |> 
  print(n = Inf)
# A tibble: 28 × 6
   name                 dupe_count species height weight homeworld
   <chr>                     <int> <chr>    <dbl>  <dbl> <chr>    
 1 Chirrut Îmwe                  3 Human     1.75     70 Jedha    
 2 Chirrut Îmwe                  3 Human     1.75     70 Jedha    
 3 Chirrut Îmwe                  3 Human     1.75     70 Jedha    
 4 Saw Gerrera                   3 Human     1.78     75 Onderon  
 5 Saw Gerrera                   3 Human     1.78     75 Onderon  
 6 Saw Gerrera                   3 Human     1.78     75 Onderon  
 7 Baze Malbus                   2 Human     1.85     95 Jedha    
 8 Baze Malbus                   2 Human     1.85     95 Jedha    
 9 Bodhi Rook                    2 Human     1.8      75 Jedha    
10 Bodhi Rook                    2 Human     1.8      75 Jedha    
11 Cassian Andor                 2 Human     1.84     75 Fest     
12 Cassian Andor                 2 Human     1.84     75 Fest     
13 Ezra Bridger                  2 Human     1.75     70 Lothal   
14 Ezra Bridger                  2 Human     1.75     70 Lothal   
15 Galen Erso                    2 Human     1.83     75 Grange   
16 Galen Erso                    2 Human     1.83     75 Grange   
17 Grand Admiral Thrawn          2 Chiss     1.95     85 Csilla   
18 Grand Admiral Thrawn          2 Chiss     1.95     85 Csilla   
19 Hera Syndulla                 2 Twi'lek   1.78     55 Ryloth   
20 Hera Syndulla                 2 Twi'lek   1.78     55 Ryloth   
21 Jyn Erso                      2 Human     1.7      60 Vallt    
22 Jyn Erso                      2 Human     1.7      60 Vallt    
23 Kanan Jarrus                  2 Human     1.91     86 Coruscant
24 Kanan Jarrus                  2 Human     1.91     86 Coruscant
25 Orson Krennic                 2 Human     1.85     80 Lexrul   
26 Orson Krennic                 2 Human     1.85     80 Lexrul   
27 Sabine Wren                   2 Human     1.65     50 Mandalore
28 Sabine Wren                   2 Human     1.65     50 Mandalore

convert_to_date()

  • Can handle mixed types (“2024-10-01” [character] vs 344402 [numeric])
  • Will handle Excel-specific date numbers (through excel_numeric_to_date())
films_messy |> 
  mutate(release_date_clean = convert_to_date(release_date),
         release_data_lub = lubridate::as_date(release_date)) |> 
  select(title, release_date, release_date_clean, release_data_lub)
# A tibble: 11 × 4
   title                        release_date release_date_clean release_data_lub
   <chr>                               <dbl> <date>             <date>          
 1 Episode I: The Phantom Mena…        36299 1999-05-19         2069-05-20      
 2 Episode II: Attack of the C…        37392 2002-05-16         2072-05-17      
 3 Episode III: Revenge of the…        38491 2005-05-19         2075-05-21      
 4 Episode IV: A New Hope              28270 1977-05-25         2047-05-27      
 5 Episode V: The Empire Strik…        29362 1980-05-21         2050-05-23      
 6 Episode VI: Return of the J…        30461 1983-05-25         2053-05-26      
 7 Episode VII: The Force Awak…        42356 2015-12-18         2085-12-19      
 8 Episode VIII: The Last Jedi         43084 2017-12-15         2087-12-17      
 9 Episode IX: The Rise of Sky…        43819 2019-12-20         2089-12-21      
10 Rogue One: A Star Wars Story        42720 2016-12-16         2086-12-18      
11 Solo: A Star Wars Story             43245 2018-05-25         2088-05-26      

convert_to_date() continued

Let’s say you have annoying date formats mixed together:

## simulating because I am incapable of making an Excel THAT messed up
release_dates <- as.character(films_messy$release_date)
release_dates[4] <- "5/25/1977"
release_dates
 [1] "36299"     "37392"     "38491"     "5/25/1977" "29362"     "30461"    
 [7] "42356"     "43084"     "43819"     "42720"     "43245"    


## convert mm/dd/yyyy to yyyy-mm-dd format
release_dates2 <- if_else(str_detect(release_dates, "/"),
                          as.character(lubridate::mdy(release_dates)),
                          release_dates)
release_dates2
 [1] "36299"      "37392"      "38491"      "1977-05-25" "29362"     
 [6] "30461"      "42356"      "43084"      "43819"      "42720"     
[11] "43245"     


convert_to_date(release_dates2)
 [1] "1999-05-19" "2002-05-16" "2005-05-19" "1977-05-25" "1980-05-21"
 [6] "1983-05-25" "2015-12-18" "2017-12-15" "2019-12-20" "2016-12-16"
[11] "2018-05-25"

remove_empty()

Removes empty columns from a data frame

films_messy |> 
  select(title, director, sl_opinion, ck_opinion)
# A tibble: 11 × 4
   title                              director         sl_opinion ck_opinion
   <chr>                              <chr>            <chr>      <lgl>     
 1 Episode I: The Phantom Menace      George Lucas     good       NA        
 2 Episode II: Attack of the Clones   George Lucas     good       NA        
 3 Episode III: Revenge of the Sith   George Lucas     good       NA        
 4 Episode IV: A New Hope             George Lucas     good       NA        
 5 Episode V: The Empire Strikes Back Irvin Kershner   good       NA        
 6 Episode VI: Return of the Jedi     Richard Marquand good       NA        
 7 Episode VII: The Force Awakens     J.J. Abrams      <NA>       NA        
 8 Episode VIII: The Last Jedi        Rian Johnson     good       NA        
 9 Episode IX: The Rise of Skywalker  J.J. Abrams      <NA>       NA        
10 Rogue One: A Star Wars Story       Gareth Edwards   good       NA        
11 Solo: A Star Wars Story            Ron Howard       good       NA        


films_messy |> 
  select(title, director, sl_opinion, ck_opinion) |> 
  remove_empty("cols")
# A tibble: 11 × 3
   title                              director         sl_opinion
   <chr>                              <chr>            <chr>     
 1 Episode I: The Phantom Menace      George Lucas     good      
 2 Episode II: Attack of the Clones   George Lucas     good      
 3 Episode III: Revenge of the Sith   George Lucas     good      
 4 Episode IV: A New Hope             George Lucas     good      
 5 Episode V: The Empire Strikes Back Irvin Kershner   good      
 6 Episode VI: Return of the Jedi     Richard Marquand good      
 7 Episode VII: The Force Awakens     J.J. Abrams      <NA>      
 8 Episode VIII: The Last Jedi        Rian Johnson     good      
 9 Episode IX: The Rise of Skywalker  J.J. Abrams      <NA>      
10 Rogue One: A Star Wars Story       Gareth Edwards   good      
11 Solo: A Star Wars Story            Ron Howard       good      

remove_constant()

Removes columns with one constant value

films_messy |> 
  select(title, director, sl_opinion, ck_opinion) |>
  remove_constant()
# A tibble: 11 × 3
   title                              director         sl_opinion
   <chr>                              <chr>            <chr>     
 1 Episode I: The Phantom Menace      George Lucas     good      
 2 Episode II: Attack of the Clones   George Lucas     good      
 3 Episode III: Revenge of the Sith   George Lucas     good      
 4 Episode IV: A New Hope             George Lucas     good      
 5 Episode V: The Empire Strikes Back Irvin Kershner   good      
 6 Episode VI: Return of the Jedi     Richard Marquand good      
 7 Episode VII: The Force Awakens     J.J. Abrams      <NA>      
 8 Episode VIII: The Last Jedi        Rian Johnson     good      
 9 Episode IX: The Rise of Skywalker  J.J. Abrams      <NA>      
10 Rogue One: A Star Wars Story       Gareth Edwards   good      
11 Solo: A Star Wars Story            Ron Howard       good      

remove_constant() continued

Can also specify whether NA’s should be considered different from a constant value

films_messy |> 
  select(title, director, sl_opinion, ck_opinion) |>
  remove_constant(
    na.rm = TRUE # default is FALSE
  )
# A tibble: 11 × 2
   title                              director        
   <chr>                              <chr>           
 1 Episode I: The Phantom Menace      George Lucas    
 2 Episode II: Attack of the Clones   George Lucas    
 3 Episode III: Revenge of the Sith   George Lucas    
 4 Episode IV: A New Hope             George Lucas    
 5 Episode V: The Empire Strikes Back Irvin Kershner  
 6 Episode VI: Return of the Jedi     Richard Marquand
 7 Episode VII: The Force Awakens     J.J. Abrams     
 8 Episode VIII: The Last Jedi        Rian Johnson    
 9 Episode IX: The Rise of Skywalker  J.J. Abrams     
10 Rogue One: A Star Wars Story       Gareth Edwards  
11 Solo: A Star Wars Story            Ron Howard      

Rounding in {base} vs rounding in {janitor}

Selecting some data to work with…

characters_messy |> 
  select(name, height) |> 
  filter(name %in% c("Leia Organa", "Pong Krell"))
# A tibble: 2 × 2
  name        height
  <chr>        <dbl>
1 Leia Organa    1.5
2 Pong Krell     2.5

Rounding in {base} vs rounding in {janitor}

base::round(1.5)
[1] 2


janitor::round_half_up(1.5)
[1] 2


base::round(2.5)
[1] 2


janitor::round_half_up(2.5)
[1] 3

round_half_up()

Rounds up any .5 (“typical” rounding convention)

characters_messy |> 
  select(name, height) |> 
  mutate(
    HEIGHT_HALFUP = round_half_up(height)
  )
# A tibble: 96 × 3
   name              height HEIGHT_HALFUP
   <chr>              <dbl>         <dbl>
 1 Luke Skywalker      1.72             2
 2 Leia Organa         1.5              2
 3 Darth Vader         2.02             2
 4 Yoda                0.66             1
 5 Han Solo            1.8              2
 6 Chewbacca           2.28             2
 7 Obi-Wan Kenobi      1.82             2
 8 Emperor Palpatine   1.73             2
 9 Anakin Skywalker    1.88             2
10 Padmé Amidala       1.65             2
# ℹ 86 more rows

signif_half_up()

Similar to round_half_up() but can specify number of significant digits

characters_messy |> 
  select(name, height) |> 
  mutate(
    HEIGHT_SIGFIG = signif_half_up(height, digits = 2)
  )
# A tibble: 96 × 3
   name              height HEIGHT_SIGFIG
   <chr>              <dbl>         <dbl>
 1 Luke Skywalker      1.72          1.7 
 2 Leia Organa         1.5           1.5 
 3 Darth Vader         2.02          2   
 4 Yoda                0.66          0.66
 5 Han Solo            1.8           1.8 
 6 Chewbacca           2.28          2.3 
 7 Obi-Wan Kenobi      1.82          1.8 
 8 Emperor Palpatine   1.73          1.7 
 9 Anakin Skywalker    1.88          1.9 
10 Padmé Amidala       1.65          1.7 
# ℹ 86 more rows

round_to_fraction()

Rounds to the nearest fraction

characters_messy |> 
  select(name, height) |> 
  mutate(
    HEIGHT_FRACTION = round_to_fraction(height, denominator = 4)
  )
# A tibble: 96 × 3
   name              height HEIGHT_FRACTION
   <chr>              <dbl>           <dbl>
 1 Luke Skywalker      1.72            1.75
 2 Leia Organa         1.5             1.5 
 3 Darth Vader         2.02            2   
 4 Yoda                0.66            0.75
 5 Han Solo            1.8             1.75
 6 Chewbacca           2.28            2.25
 7 Obi-Wan Kenobi      1.82            1.75
 8 Emperor Palpatine   1.73            1.75
 9 Anakin Skywalker    1.88            2   
10 Padmé Amidala       1.65            1.75
# ℹ 86 more rows

compare_df_cols()

Used to see whether column classes conflict and prevent row binding

compare_df_cols(characters_human_messy, characters_nonhuman_messy)
    column_name characters_human_messy characters_nonhuman_messy
1           bmi              character                 character
2  death_status              character                 character
3   description              character                 character
4     eye_color              character                 character
5        gender              character                 character
6    hair_color              character                 character
7        height                numeric                   numeric
8     homeworld              character                 character
9            id                numeric                   numeric
10         name              character                 character
11   skin_color              character                 character
12      species              character                 character
13       weight                numeric                   numeric
14    year_born                numeric                   numeric
15    year_died                numeric                   numeric

compare_df_cols() continued

characters_human_extra_messy <-
  characters_human_messy |> 
  mutate(
    across(everything(), ~as.character(.x))
  )


compare_df_cols(characters_human_extra_messy, characters_nonhuman_messy)
    column_name characters_human_extra_messy characters_nonhuman_messy
1           bmi                    character                 character
2  death_status                    character                 character
3   description                    character                 character
4     eye_color                    character                 character
5        gender                    character                 character
6    hair_color                    character                 character
7        height                    character                   numeric
8     homeworld                    character                 character
9            id                    character                   numeric
10         name                    character                 character
11   skin_color                    character                 character
12      species                    character                 character
13       weight                    character                   numeric
14    year_born                    character                   numeric
15    year_died                    character                   numeric

compare_df_cols_same()

Returns TRUE or FALSE saying if the dataframes can successfully row bind

compare_df_cols_same(characters_human_messy, characters_nonhuman_messy)
[1] TRUE


compare_df_cols_same(characters_human_extra_messy, characters_nonhuman_messy)
  column_name       ..1     ..2
1      height character numeric
2          id character numeric
3      weight character numeric
4   year_born character numeric
5   year_died character numeric
[1] FALSE

tabyl() basics

tabyl() is halfway between table() and gtsummary::cross_tab()

## one variable
characters_messy |> 
  tabyl(death_status)
 death_status  n percent
        alive 36   0.375
         dead 60   0.625


## two variables
characters_messy |> 
  tabyl(death_status, bmi)
 death_status BMI High BMI Low NA_
        alive       11      22   3
         dead       26      33   1

Note: you can untabyl() a tabyl to remove all tabyl properties

adorn_*()

Functions to make your “tabyls” more readable:

  • adorn_title() – add column name to the top of a two-way table
  • adorn_percentages() – converts tabyl of counts to percentages
  • adorn_totals() – add row/column totals to a tabyl


Functions that come after using adorn_percentages()

  • adorn_pct_formatting() – formatting decimals as percentages
  • adorn_ns() – add underlying Ns to a tabyl with percentages
  • adorn_rounding() – round numeric columns

Adorn with totals and titles

characters_messy |> 
  tabyl(death_status, bmi, show_na = FALSE) |> 
  adorn_totals() |> 
  adorn_title()
                   bmi        
 death_status BMI High BMI Low
        alive       11      22
         dead       26      33
        Total       37      55

Adorn with percentages

characters_messy |> 
  tabyl(death_status, bmi, show_na = FALSE) |> 
  adorn_percentages() |> 
  adorn_pct_formatting()
 death_status BMI High BMI Low
        alive    33.3%   66.7%
         dead    44.1%   55.9%


characters_messy |> 
  tabyl(death_status, bmi, show_na = FALSE) |> 
  adorn_percentages() |> 
  adorn_pct_formatting() |> 
  adorn_ns()
 death_status   BMI High    BMI Low
        alive 33.3% (11) 66.7% (22)
         dead 44.1% (26) 55.9% (33)

Adorn with titles, totals, percentages

Be careful of the order these functions are in…

  • adorn_totals() must go before all other adorn_*() functions (gives somewhat helpful error)
  • adorn_title() must go after adorn_percentages() (will give nonsense error)
characters_messy |> 
  tabyl(death_status, bmi, show_na = FALSE) |> 
  adorn_totals() |>
  adorn_percentages() |> 
  adorn_pct_formatting() |> 
  adorn_title()
                   bmi        
 death_status BMI High BMI Low
        alive    33.3%   66.7%
         dead    44.1%   55.9%
        Total    40.2%   59.8%

Takeaways

  • Function masking is something to be aware of!
  • Base R may not be rounding like you think it is…
  • Some useful functions from {janitor}:
    • clean_names() (in every 10-setup_XXX.qmd)
    • get_dupes()
    • convert_to_date()
    • compare_df_cols() / compare_df_cols_same()